adding gujarati vocabulry dec 4 #1811

sarjil77 · 2024-12-03T19:11:23Z

here i am adding the gujarati vocabulary

felixdittrich92 · 2024-12-04T09:17:09Z

doctr/datasets/vocabs.py

@@ -22,6 +22,13 @@
    "hindi_letters": "अआइईउऊऋॠऌॡएऐओऔअंअःकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह",
    "hindi_digits": "०१२३४५६७८९",
    "hindi_punctuation": "।,?!:्ॐ॰॥॰",
+    "gujarati_vowels": "અઆઇઈઉઊઋએઐઓઔઅંઅઃ ",
+    "gujarati_digits":"૦૧૨૩૪૫૬૭૮૯",
+    "gujarati_diacritics_consonants":"""કકાકિકીકુકૂકૃકેકૈકોકૌકંકઃખખાખિખીખુખૂખૃખેખૈખોખૌખંખઃગગાગિગીગુગૂગૃગેગૈગોગૌગંગઃઘઘાઘિઘીઘુઘૂઘૃઘેઘૈઘોઘૌઘંઘઃઙઙાઙિઙીઙુઙૂઙૃઙેઙૈઙોઙૌઙંઙઃચચાચિચીચુચૂચૃચેચૈચોચૌચંચઃછછાછિછીછુછૂછૃછેછૈછોછૌછંછઃ


@sarjil77 I tested a bit with your added vocab which raised some issues .. because we can't encode it - if i understood it correctly the leading letter combined with the dotted circle (for example: કૌ) is combined to one character but programmatically it's counted as 2 characters .. is there anyway to make these strings unicode conform ?

So at the end that each character in an image corresponds to 1 encoded character

if i filter your diacritics i get the following:

ઃકખગઘઙચછજઝઞટઠડઢણતથદધનપફબભમયરલવશષાિીુૂૃેૈોૌ્

btw with multiline strings the string needs to end with \ otherwise it's counted as linebreak

@sarjil77 Something like this:

"gujarati_letters": "તગખઢરજયશઆઐઊૂેપફુ્ઓૈાથીડૃદઠવનલષકિઅભઘઉઔઝઙઇઞઈધૌછટચબોમએણઋ", "gujarati_digits":"૦૧૨૩૪૫૬૭૮૯", "gujarati_punctuation": "૰ઽ◌ંઃ॥ૐ" + "૱",

length: 103
all chars: તગખઢરજયશઆઐઊૂેપફુ્ઓૈાથીડૃદઠવનલષકિઅભઘઉઔઝઙઇઞઈધૌછટચબોમએણઋ૦૧૨૩૪૫૬૭૮૯૰ઽ◌ંઃ॥ૐ૱!"#$%&'()*+,-./:;<=>?@[]^_`{|}~

? Not sure anyway 😅

This is what i get if i deduplicate it in python

the single diacritics (as addition to a char) are counted as standalone symbol

thanks @felixdittrich92 , noted, i am not sure right now, but i will look further into this.
:)

hello @felixdittrich92, you are right it is considering 2 characters like "ફુ્" which is diacritic which is taking 6 bytes. So in order to handle the diacritics to consider as a single character, we can use NFC (Normalization Form C) which will combine character with their diacritics into single code character and does not change the actual encoding or byte representation.

for eg:
import unicodedata

txt = "ફુ્"

encoded_string = txt.encode()

normalized_text = unicodedata.normalize('NFC', txt)

print(f'encoded string is:',encoded_string)
print(f'the length of encoded string is: {len(encoded_string)} ')
print(f'normalized_text is:', normalized_text)
print(f'the length of normalized encoded string is:{len(normalized_text)}')

output:
encoded string is: b'\xe0\xaa\xab\xe0\xab\x81\xe0\xab\x8d'
the length of encoded string is: 9
normalized_text is: ફુ્
the length of normalized encoded string is:3

please do have a look on this, and i do not know how other people have added diacritics, here we can also add just consonants and vowels but it will not make any sense.

Let me know your thoughts.

felixdittrich92 · 2024-12-06T13:54:36Z

@sarjil77 Take a look here should be enough to copy paste these changes: main...felixdittrich92:doctr:gujarati-vocab-test

Tested with your "full" vocab i can completely reproduce it so all chars are added and a string can be encoded char by char :)

Before you should pull the latest changes from main and rebase your branch :)

sarjil77 · 2024-12-06T16:58:08Z

@felixdittrich92 The gujarati letters you provided in your commits contains the significant portion of Gujarati alphabets but is not entirely complete you are missing 2 major vowels and 3 consonants so total 5 letters are missing. based on diacritcs which i have provided before you are right, but this 5 letters are additional ones and doesnt have diacritics so i missed them (sorry) but they are also frequently used so we cant ignore them. And i would strongly recommend to keep vowels and consonants separate, it will be better for traceability and in future it may help to trace.

felixdittrich92 · 2024-12-06T17:26:46Z

@felixdittrich92 The gujarati letters you provided in your commits contains the significant portion of Gujarati alphabets but is not entirely complete you are missing 2 major vowels and 3 consonants so total 5 letters are missing. based on diacritcs which i have provided before you are right, but this 5 letters are additional ones and doesnt have diacritics so i missed them (sorry) but they are also frequently used so we cant ignore them. And i would strongly recommend to keep vowels and consonants separate, it will be better for traceability and in future it may help to trace.

Then i would say feel free to add the missing ones - what you see is your vocab but deduplicated :)
I simply did "".Join(sorted(list(set(VOCABS["gujarati_letters"])))) to filter them
Also for the splitting feel free would be fine on my end 👍🏼

felixdittrich92 · 2024-12-06T17:30:47Z

@sarjil77 Don't miss to rebase before please in the meanwhile i added a test case to Check the VOCABS entry values for duplicates :)

sarjil77 · 2024-12-06T18:07:04Z

@felixdittrich92 i think done from my side :) haa,

thanks.

felixdittrich92 · 2024-12-06T18:23:16Z

@sarjil77 your branch needs still to be rebased (see there is a conflicting file) :)

Update your fork
Checkout main and pull
Checkout your branch and pull / rebase
Then force push your changes

Additional the docs entry is missing take a look in my provided branch :)

sarjil77 · 2024-12-06T19:33:24Z

i think now it is good. :)

felixdittrich92 · 2024-12-06T19:55:26Z

@sarjil77 it's still not rebased on main :)
See:

This branch has conflicts that must be resolved
Use the web editor or the  to resolve conflicts.
Conflicting files
doctr/datasets/vocabs.py

And about the documentation entry if you have added more chars i think the number and char string has changed also ;) so please fix this :D

changes with vocab and documentation dec 10

Bumps the github-actions group with 1 update: [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action). Updates `JamesIves/github-pages-deploy-action` from 4.7.1 to 4.7.2 - [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases) - [Commits](JamesIves/github-pages-deploy-action@v4.7.1...v4.7.2) --- updated-dependencies: - dependency-name: JamesIves/github-pages-deploy-action dependency-type: direct:production update-type: version-update:semver-patch dependency-group: github-actions ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

updating both vocab and documentation

sarjil77 · 2024-12-09T19:25:51Z

hey @felixdittrich92 , i think i have chnaged both the documentation and vocab, can you please look into it, and from my side i have checked for any conflicts. :)

felixT2K · 2024-12-10T15:06:22Z

@sarjil77 Looks like you merged the main branch into your feature branch instead of rebasing your branch on main 😅
See: https://github.com/mindee/doctr/pull/1811/files (It includes some previous already merged commits)

2 options:

Rebase your branch on main git checkout <YOUR_FEATURE_BRANCH> -> git rebase main (preffered option)
Close this PR and create a new feature branch / Add your changes / create a fresh PR

👍🏼

sarjil77 · 2024-12-11T05:51:18Z

OKay i am closing this PR and will do as you suggested.
how can i be so stupid, let me correct it, i thought it was okay, sorry. :D

felixdittrich92 · 2024-12-11T07:48:00Z

OKay i am closing this PR and will do as you suggested. how can i be so stupid, let me correct it, i thought it was okay, sorry. :D

Don't worry that's little things you will grow on believe me :) Without making things wrong we wouldn't learn ^^

adding gujarati vocabulry dec 4

4b03ffa

felixdittrich92 reviewed Dec 4, 2024

View reviewed changes

updated gujarati vocab dec 6

90707c0

updated docs and gujarati langugae vocab

f88a316

felixdittrich92 and others added 4 commits December 10, 2024 00:24

[Feat] Add torch.compile support (mindee#1791)

394661f

[Bug] Fix vocabs and add corresponding test case (mindee#1813)

bc1837e

changes with vocab and documentation dec 10

updates dataset.srt

8fba03c

updating both vocab and documentation

sarjil77 closed this Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding gujarati vocabulry dec 4 #1811

adding gujarati vocabulry dec 4 #1811

sarjil77 commented Dec 3, 2024

felixdittrich92 Dec 4, 2024

felixT2K Dec 4, 2024

sarjil77 Dec 5, 2024

sarjil77 Dec 6, 2024

felixdittrich92 commented Dec 6, 2024

sarjil77 commented Dec 6, 2024 •

edited

Loading

felixdittrich92 commented Dec 6, 2024 •

edited

Loading

felixdittrich92 commented Dec 6, 2024

sarjil77 commented Dec 6, 2024

felixdittrich92 commented Dec 6, 2024

sarjil77 commented Dec 6, 2024

felixdittrich92 commented Dec 6, 2024

sarjil77 commented Dec 9, 2024

felixT2K commented Dec 10, 2024

sarjil77 commented Dec 11, 2024

felixdittrich92 commented Dec 11, 2024

adding gujarati vocabulry dec 4 #1811

adding gujarati vocabulry dec 4 #1811

Conversation

sarjil77 commented Dec 3, 2024

felixdittrich92 Dec 4, 2024

Choose a reason for hiding this comment

felixT2K Dec 4, 2024

Choose a reason for hiding this comment

sarjil77 Dec 5, 2024

Choose a reason for hiding this comment

sarjil77 Dec 6, 2024

Choose a reason for hiding this comment

felixdittrich92 commented Dec 6, 2024

sarjil77 commented Dec 6, 2024 • edited Loading

felixdittrich92 commented Dec 6, 2024 • edited Loading

felixdittrich92 commented Dec 6, 2024

sarjil77 commented Dec 6, 2024

felixdittrich92 commented Dec 6, 2024

sarjil77 commented Dec 6, 2024

felixdittrich92 commented Dec 6, 2024

sarjil77 commented Dec 9, 2024

felixT2K commented Dec 10, 2024

sarjil77 commented Dec 11, 2024

felixdittrich92 commented Dec 11, 2024

sarjil77 commented Dec 6, 2024 •

edited

Loading

felixdittrich92 commented Dec 6, 2024 •

edited

Loading